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Abstract 

We show that asymptotically, completely asynchronous stochastic gradient procedures achieve 
optimal (even to constant factors) convergence rates for the solution of convex optimization prob¬ 
lems under nearly the same conditions required for asymptotic optimality of standard stochastic 
gradient procedures. Roughly, the noise inherent to the stochastic approximation scheme dom¬ 
inates any noise from asynchrony. We also give empirical evidence demonstrating the strong 
performance of asynchronous, parallel stochastic optimization schemes, demonstrating that the 
robustness inherent to stochastic approximation problems allows substantially faster parallel 
and asynchronous solution methods. 


1 Introduction 

We study a natural asynchronous stochastic gradient method for the solution of minimization 
problems of the form 

minimize f(x):=Kp[F(x]W)]= f F(x;u)dP(uj ), (1) 

J n 

where x H > F(x;u) is convex for each oj € 0, P is a probability distribution on 0, and the vector 
x € M. d . Stochastic gradient techniques for the solution of problem ([1]) have a long history in 
optimization, starting from the early work of Robbins and Monro [29:1 and continuing on through 
Ermoliev [12] and Polyak and Juditsky [2f| and Nemirovski et al. (23|] . The latter two papers show 
how certain long stepsizes and averaging techniques yield more robust and asymptotically optimal 
optimization schemes, and we show how their results extend to practical parallel and asynchronous 
optimization settings. 

We consider an extension of previous stochastic gradient methods to a natural family of asyn¬ 
chronous gradient methods (see, e.g., the book of Bertsekas and Tsitsiklis [5]), where multiple 
processors can draw samples from the distribution P and asynchronously perform updates to a 
centralized parameter vector x. Our iterative scheme is based on the Hogwild! algorithm of Niu 
et al. [Ai], which is designed to asynchronously solve certain stochastic optimization problems in 
multi-core environments, though our analysis and iterations are different. In particular, we study 
the following procedure, where each processor runs asynchronously and independently of the others, 
though they maintain a shared iteration counter k ; each processor performs the following: 

(i) Processor reads current problem data x and counter k 

(ii) Processor draws a random sample W ~ P, computes g = X7 F(x]W), and increments a 
centralized counter k 

(iii) Processor updates x <— x — via sequential updates [x\j = [x\j — ak[g\j for each coordinate 
j €{!,.. .,d}. 
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In the iterations (0) (Uni) , the scalars a.k are a non-increasing stepsize sequence. 

1.1 Main results and outline 

The thrust of our results is that because of the noise inherent to the sampling process for W, 
the errors introduced by asynchrony in the iterations (0)- (lull) are asymptotically negligible: they 
do not matter. Even more, we can efficiently construct an x from the asynchronous process pos¬ 
sessing optimal convergence properties and asymptotic variance. This has consequences for solving 
stochastic optimization problems on multi-core and multi-processor systems; we can leverage paral¬ 
lel computing without performing any synchronization, so that given a machine with m processors, 
we can read data and perform updates m times more quickly than what is possible with a single 
processor, and the error from reading stale information on x becomes asymptotically negligible. 
In Section 01 we make these optimality claims formal, presenting our main convergence theorems 
about the asynchronous iterations (0) (lull) for solving the problem (0]). Our main result, Theorem0] 
gives explicit conditions under which an asynchronous stochastic gradient procedure converges at 
the optimal rate and with optimal asymptotic variance, and we give applications to specific stochas¬ 
tic optimization problems in Section 12.21 We also provide a more general result (Theorem [2]) on 
the asynchronous solution of more general stochastic operator equations, again demonstrating that 
asynchrony introduces asymptotically less noise than that inherent in the stochastic problem itself. 
While we give explicit conditions under which our results hold, we note that roughly all we require 
is a type of local strong convexity around the optimal point x* = argrnin x f(x), that the Hessian 
of / be positive definite near x*, and a Lipschitz (smoothness) condition on the gradients V/(x). 

In addition to theoretical results, in Section [3] we give empirical results on the power of paral¬ 
lelism and asynchrony in the implementation of stochastic approximation procedures. Our exper¬ 
iments demonstrate two results: first, even in non-asymptotic finite-sample settings, asynchrony 
introduces little degradation in solution quality, regardless of data sparsity (a common assumption 
in previous analyses); that is, asynchronously-constructed estimates are statistically efficient. Sec¬ 
ond, we show that there is some subtlety in implementation of these procedures in real hardware; 
while increases in parallelism lead to concomitant linear improvements in the speed with which we 
compute solutions to problem (0]), in some cases we require strategies to reduce hardware resource 
competition between processors to achieve the full benefits of asynchrony. 

1.2 Related work 

Several researchers have provided and analyzed asynchronous algorithms for optimization. The 
seminal work of Bertsekas and Tsitsiklis jh] provides a comprehensive study both of models of asyn¬ 
chronous computation and analyses of asynchronous numerical algorithms, including coordinate- 
and gradient-descent methods. The results of theirs relevant to our work are roughly of two types. 
For non-stochastic problems, they show (roughly) linear convergence of iterative methods assum¬ 
ing the iterations satisfy certain contractive properties, which roughly correspond to variants of 
diagonal dominance of the Hessian of / (see, for example Chapters 6.3 and 7.5]). For stochas¬ 
tic problems [s|. Chapter 7.8], they show results that have a similar flavor to ours: errors due to 
asynchrony scale approximately quadratically in the stepsize a, while gradient information scales 
linearly with a so that it dominates other errors. Bertsekas and Tsitsiklis use this error scaling to 
show that stepsize choices of the form a*, « 1/k guarantee asymptotic convergence under models 
of asynchrony with bounded delay. They leave open, however, a few interesting questions, namely, 
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attainable rates of convergence for asynchronous stochastic procedures, the effects of unbounded 
delays, and what optimality guarantees are possible relative to synchronous executions. 

Due to their simplicity and dimension-independent convergence properties, stochastic and non¬ 
stochastic gradient methods have become extremeley popular for large-scale data analysis and 
optimization problems (e.g. 23, 24, i 0, 0). Consequently, with the advent of multi-core pro¬ 


cessing systems, there has been substantial work building on Bertsekas’ and Tsitsiklis’s results. 
Much of this work shows that asynchrony introduces negligible penalty in rates of convergence for 
optimization procedures under suitable conditions, such as gradient sparsity, conditioning of the 
Hessian of /, or allowable types of asynchrony (none, as we show, are essential). Niu et al. [25|] 


propose the Hogwild! method and show that under strong sparsity and smoothness assumptions 
on the data (essentially, that the gradients X F(x\W) have a vanishing fraction of non-zero entries, 
that / is strongly convex, and Vi ? (x;w) is Lipschitz for all u), convergence guarantees similar to 
the synchronous case are possible. Agarwal and Duchi 0] showed under restrictive ordering as¬ 
sumptions th at some de layed gradient calculations have negligible asymptotic effect. Duchi et al. 
0 extended iNiu et all ’s results to a dual averaging algorithm that works for non-smooth, non 


strongly-convex problems, again so long as strong gradient sparsity assumptions hold (roughly, 
that the probability of an entry of X7F(x; u) being non-zero is inversely proportional to the maxi¬ 
mum delay of any processor) and delays are bounded. Researchers have also investigated parallel 


coordinate descent solvers: Richtarik and Takac [28|] and Liu et al. 21[ show how certain “separa¬ 
bility” properties of an objective function / -meaning the degree to which different coordinates of x 
jointly affect f(x) (rather than f(x) depending on each coordinate Xj independently)—govern con¬ 
vergence rate of parallel coordinate descent methods, the latter focusing on asynchronous schemes. 
The conditions sufficient for fast or asynchronous convergence of coordinate methods are similar to 
the diagonal dominance conditions used by Bertsekas and Tsitsiklis [h, Chapter 6.3.2]. Yet, as we 
show, large-scale stochastic optimization renders many of these problem assumptions unnecessary. 
In particular, the asynchronous iterations (Hl)- ([ml) retain all optimality properties of synchronous 
(correct) gradient procedures, even in the face of nearly unbounded delays, and enjoy optimal rates 
of convergence, even to constant pre-factors. 


Notation We say a sequence of random variables or vectors X n converges in distribution to a 
random variable Z, denoted X n -w Z, if for all bounded continuous functions E [f(X n )] —> E[/(Z)]. 
We say that a sequence of (finite-dimensional) random vectors X n converges in L p to a random 

vector Z, denoted X n -4 Z, if there is a norm ||-|| such that E[||X n — Z\\ p ] —>• 0 as n —>• oo. This 
convergence is equivalent for any choice of the norm ||-||. We let X n —>• Z denote that X n converges 
in probability to Z, meaning that P(||X n — Z\\ > e) —>• 0 as n —>• oo for any e > 0, and X n c ^4' c 
denotes almost sure convergence, meaning that P(lim n X n ^ c) = 0. The notation N(^,E) denotes 
the multivariate Gaussian with mean /i and covariance E. We let Idxd denote the identity matrix 
in M rfxd , using I when the dimension is clear from context. 

We use standard big-O notation. For (nonnegative) sequences a n and b n , we let a n < b n mean 
there exists a constant C < oo such that a n < Cb n for all n, and a n x b n means that there exist 
constants 0 < c < C < oo such that c < liminf n < limsup n ^ < C. For random vectors X n , Z n , 
we say X n = Op(Z n ) if for all e > 0, there exists C < oo such that sup n P(||X n || > C ||Z n ||) < e, 
while X n = op(Z n ) means that for all c > 0, limsup n P(|| X n \\ > c||Z n ||) = 0. 
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2 Main results 


Our main results repose on a few standard assumptions often used for the analysis of stochastic 
optimization procedures, which we now detail, along with a few necessary definitions. We let k 
denote the iteration counter used throughout the asynchronous gradient procedure. Given that 
we compute g = VF(x;W) with counter value k in the iterations (JTJ) (lull) , we let Xk denote the 
(possibly inconsistent) particular x used to compute g , and likewise say that g = gk, noting that 
the update to x is then performed using a^. 

With the update ( Hull , we can give a more explicit formula for Xk as a function of time k with 
a small amount of additional notation. In particular, let E kl € {0, l} dxd be a diagonal matrix 
whose jth diagonal entry is 1 if the zth gradient (i.e. that computed when the iteration counter is 
i) has been incorporated into iterate Xk and is 0 otherwise. Then the iteration ([H)- (jm]) and index 
assigments imply that 

fc-i 

x k = -^a i E ki g i . (2) 

1=1 

With this definition of the update matrices E ki , we then associate a delay value Mk for each k, 
defined by 

M k := min (I - k : E lk = I dxd , l > k}, 

or the amount of time required for all updates from the fcth gradient to be incorporated into the 
central x vector. Rather than assuming a uniform bound on the delay, throughout, we make the 
following assumption on the moments of the random variables Mk . 

Assumption A. There exists r > 2 and a constant M < oo such that 

sup E [M^] * < M. 

k 

Assumption lAl places our asynchronous iterations (P) ( lull) somewhere between Bertsekas’ and Tsit- 
siklis’s classification of totally asynchronous algorithms [5|, Chapter 6], which require only that each 
processor performs its updates eventually, and partially asynchronous algorithms Chapter 7], 
which specify a uniform bound on any processor’s delay; roughly, we have a quantitative version 
of total asynchrony. For example, if we know that processors have bounded delays, we may take 
r = oo and assume that M = sup fc Mk < oo. In more general cases, however, we can allow infre¬ 
quent longer delays with, as we shall see, negligible effect on our results except that the allowable 
stepsizes a* are more restricted. 

2.1 Asynchronous convex optimization 

We now present our main theoretical results for solving the stochastic convex problem (fT]l . giving 
the necessary assumptions on / and F(-\W) for our results. Our first assumption roughly states 
that / has a unique minimizer x*, that / has a quadratic expansion near the point x*, and is 
smooth (similar assumptions are common [e.g. l2fil. [jj| and are satisfied in our applications). 

Assumption B. The function f has unique minimizer x* and is twice continuously differentiable 
in the neighborhood of x* with positive definite Hessian H = V 2 /(x*) >- 0. There is a covariance 
matrix X 0 such that 

E[\7F(x*]W)VF(x*-W) t ] = X. 
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Additionally, there exists a constant C < oo such that the gradients VF(x; W) satisfy 


E[\\\7F(x]W)-\7F(x*-W)\\ 2 }<C\\x-x*\\ 2 , allx£R d . (3) 

Lastly, f has L-Lipschitz continuous gradient, meaning ||V/(x) — V/(y)|| < L\\x — y\\ for x,y € 

R d . 


Assumption [B] guarantees the uniqueness of the vector x * minimizing f(x) over and ensures 
that / is well-behaved enough for our asynchronous iteration procedure to introduce negligible noise 
over a non-asynchronous procedure. In addition to Assumption (Bj we make one of two additional 
assumptions. In the first case, we assume that / is strongly convex: 

Assumption C. The function f is X-strongly convex over all ofM. d for some A > 0, that is, 

f(y)>f(x) + {Vf(x),y-x) + ^\\x-y\\ 2 forx,y£R d . (4) 

Our alternate assumption is a Lipschitz assumption on / itself, made by virtue of a second moment 
bound on WF(x] W ). 

Assumption C’. There exists a constant G < oo such that for all x € M d ; 

E[\\VF(x;W)\\ 2 }<G 2 . (5) 

In Section 12.21 to come, we give examples in which all of these assumptions are satisfied, showing 
that they are not too restrictive. 

With our assumptions in place, we obtain our main theorem. 

Theorem 1. Let Assumptions 170 with moment r > 2 and\B\hold. Let the iterates Xk he generated 
by the asynchronous process (pj, (jn|) . (lull) with step size choice a k = ak~^ , where /3 € + —rj, 1) 
and a > 0. Then if either of Assumptions [3 or ED holds, we have x n ^4' x* and 

1 n 

- x*) - N (0, H~ l TH~ l ) = N (0, (V 2 /(x*)) _1 S(V 2 /(x*)) _1 ) . 

V k =1 

Before moving to example applications of Theorem [lj, we make a few additional remarks on the 
theorem, its consequences, and its associated conditions. Let x n := ^ X4=i x k f° r shorthand. First, 
using the delta method [e.g. [ThI . Theorem 1.8.12], we can give convergence rates for the function 
values f(x n ) to f(x*). Specifically, they converge at the optimal rate of 1/n, and we can give 
explicit constants. 

Corollary 1. Let the conditions of Theorem^ hold. Then 

n (f(x n ) - f(x*)) ^ 1 tr [H" 1 ^] ■ X j, 

where x\ denotes a chi-squared random variable with 1 degree of freedom, and H = V 2 /(x*) and 
S = E[VF(x*; IT)VF(x*; W) T }. 
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Proof Theorem [T] implies x n ^4’ x*. By a Taylor expansion, we have 


n(f(x n ) - f(x*)) = n 


(V/(x*),x n - x*) + ^ (x n - x*, V 2 /(x*)(x n -x*)) + E 3 (x n - x *) 


where the error term E 3 satishes \E%(x — x*)| < r(x) \\x — ar*|| 2 for a function r satisfying r(x) —>• 0 
as x —>• x*. As V/(x*) = 0, we see that for a remainder |r n | < r(x n ) ■ n ||x n — A 0, we have 

n (/(x„) - /(x*)) = i ^n _ 2 (x n - x*), V 2 /(x*)n _ s(x n - x*)^} + r n . 

By Theorem [TJ the first term is asymptotically distributed as Z T HZ for Z ~ N(0, 

and applying Slutsky’s theorem 3l|, Theorem 2.7] and the continuous mapping theorem gives the 

result. □ 


Moreover, the convergence guarantee in Theorem [I] is generally unimprovable even by numerical 
constants. Recall that we have 


\/n{x n - x*) 4 N (0. 

Standard results in asymptotic statistics imply that the rate of con verg ence and covariance 
are optimal. Indeed, the Le Cam-Hajek local minimax theorem [13] implies that in standard sta¬ 
tistical models, if we define the balls B(x,t) = {x' E M l/ : ||x — x'\\ < t}, then 


lim lim inf sup 

t —>00 n—>00 p 


nE[||x n — x|| 2 ] : x = argminE[F(x / ; W)\ € B(x*, t/y/n) 


> tr [H-^H- 1 ] 


where the supremum is taken over loss functions F satisfying Assumptions [B] and O or [Cl and 
x n is any sequence of estimators based on observing a sample Wi,... ,W n . The Le Cam-Haiek 
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convolution theorem 17]] and classical calculations with Bahadur efficiency (cf. van der Vaart 
Chapter 8]) also show that the asymptotic covariance is also generally optimal, meaning 

that no estimator can converge faster than yjn, and the asymptotic covariance S of essentially any 
estimator converging at the rate y/n must satisfy S Z . 


More concisely, in spite of the asynchrony we allow in the iterations (Hl)- (fin|) . we attain the best 
possible convergence rate. For the decision variables x, the rate n~ 2 is unimprovable, as is—at 
least generally—the asymptotic covariance H~ 1 T,H~ 1 . We also have f(x n ) — f(x*) = Op(n ~ 1 ), 
which is information-theoretically optimal 22.13|. So we see that quite literally, the noise inherent 
to the sampling in stochastic gradient procedures swamps any noise introduced by asynchrony. 


2.2 Examples 

We now give two classical statistical optimization problems to illustrate Theorem [lj We verify that 
the conditions of the theorem hold for each of the examples, and these show that the conditions of 
Assumptions iBl and ICl or 1C ’I are not overly restrictive. 

Linear regression Standard linear regression problems satisfies the conditions of Assumption ICl 
under an additional fourth moment condition. In this case, the data oj = (a, b) E x R and 
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the objective F{x\u) = |((a,x) — b) 2 . If we have moment bounds EfUa)^] < oo, E[6 2 ] < oo and 
H = E[aa T ] >- 0, we have V 2 /(x*) = H, and 


E 


||VF(x;IT) - VF(x*;W)|| 


= E 



x *)) 2 


< E 



x — x* 


112 
112 > 


whence the assumptions of Theorem [T] are satisfied. We can give more explicit calculations if we 
make standard modeling assumptions, for example, that b = (a, x*) + e, where e is an independent 
mean-zero noise sequence with E[e 2 ] = a 2 . In this case, the minimizer of f(x) = E[E(x; W)\ is x*, 
we have (a,x*) — b = —e, and 

E[VF(x*;IT)VF(x*;IT) t ] = E[((a, x*} - 6)aa T ((a,x*) - 6)] = E[aa T e 2 ] = a 2 K[aa T } = a 2 H. 


In particular, the asynchronous iterates satisfy 

n 

n ~2 ^(xfc — x*) N(0, a 2 !!- 1 ) = N ^0, cr 2 E[aa T ]“ 1 ^ . 

k =1 


This has the asymptotic variance of the ordinary least squares estimate of x*, which is minimax 
optimal [18|, Chapter 5]. 


Logistic regression As long as the data has finite second moment, logistic regression problems 
satisfy all the conditions of Assumption [C3 in Theorem [Tj In this case we have w = (a, b) € 
W l x {—1,1} and instantaneous objective F(x;cu) = log(l + exp (—b (a, x))). For hxed w, this 
function is Lipschitz continuous and has gradient and Hessian 


VF(x; u) = — 


1 


1 + exp(6 (a, x)) 


ba and V 2 F(x;w) = 


0 b(a,x) 


(1 + e b(a,x)y 


aa 


T 


where VF(x;w) is Lipschitz continuous as ||V 2 F(x; (a, 6))|| < \ ||a|||. Thus, so long as E[||a,|||] < 00 
and E[V 2 F(x*; IT)] 0 (the latter occurs if E[aa T ] is full rank), logistic regression satisfies the 
conditions of Theorem |TJ In particular, the asynchronous stochastic gradient method achieves 
optimal convergence guarantees. 


2.3 Extension to nonlinear problems and variational inequalities 

We prove Theorem [T] by way of a more general result on finding the zeros of a residual operator 
R : } R t? —>■ M rf , where we only observe noisy views of R(x), and there is unique x* such that 
R(x*) = 0. Such situations arise, for example, in the solution of stochastic monotone operator 
problems (cf. Juditsky, Nemirovski, and Tauvel M or Bertsekas and Tsitsiklis 0) , including finding 
equilibria in stochastic convex Nash games (e.g. [la . Sec. 2.1], 0, Ex. 3.5.1(d)]), general saddle-point 
problems, or multi-user routing problems [5], Ex. 3.5.1(c)]. In this more general setting, we consider 
the following stochastic and asynchronous iterative process, which extends that for the convex case 
outlined previously. Each processor performs the following asynchronously and independently: 

(i) Processor reads current problem data x and counter k 

(ii) Processor receives vector g = R(x) + £, where £ is a random (conditionally) mean-zero noise 
vector, and increments a centralized counter k 
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(iii) Processor updates x <— x — a^g via sequential updates [x]j = [x\j — ctk[g\j for each coordinate 

j € {1, • • • ,d} 

As in the convex case, we associate vectors Xk and g k with the update performed using a*., and 
we let denote the noise vector used to construct g As before, these iterates and assignment of 
indices again imply that Xk has the form ([2]) , that is, Xk = — Yli=i a iE kl gi for diagonal matrices 
E ki capturing the updates that have been performed at time k. 

For this iterative process, we define the increasing sequence of cr-fields J~k by 


J 7 k=cr (6, ••• ,£fc, { ElJ : * < k + 1 , j < *}) , (6) 

that is, the noise variables £& are adapted to the filtration J-)., and these cr-fields are the smallest 
containing both the noise and all index updates that have occurred and that will occur to compute 
Xk+i- Thus we have Xk+i £ -Tfc, and our mean-zero assumption on the noise £ is that 


E[£fc | Tk- i] = 0. 


Our analysis builds off of [Polyak and Juxhtskyfs study [26j] of averaging in stochastic approxi¬ 
mation, and we model our requirements for the convergence of the preceding iteration on those they 
use for the solution of the nonlinear equality R(x*). First, we assume there is a Lyapunov function 
V that functions (essentially) as a squared norm, which satisfies V(x) > A||x|| 2 for all x £ M d , 
||VP(x) — VV'(y)|| < L\\x — y\\ for all x,y, that VV(0) = 0, and P(0) = 0. Note in particular that 
this implies 

12 ^ / T/Yn\ , /WT^/nN „ n\ i ~ ~ ll„l|Z ^ 


A ||xf < V{x) < fo(0) + (VF(0),s- 0) + ^ ||x|| 2 = ^ ||x|| 2 


and that ||VP(x)|| 2 < L 2 \[x] | 2 < (L 2 / X)V(x). In addition, we make the following assumptions on 
the residual function (cf. [20, Assumption 3.2]). 


Assumption D. There exists a matrix H £ with H y 0, a parameter 0 < 7 < 1, constant 

C < 00 , and some e > 0 such that if x satisfies ||x — x*|| < e, then 

||i2(®) - H(x - ®*)|| <C\\x- x*|| 1+7 . 

Assumption [D] essentially requires that R is differentiable at x* with derivative matrix H y 0. 
We also make a few assumptions on the noise process £ paralleling Assumption 3.3 of Polyak 
and Juditsky 26]; specifically, we assume £ implicitly depends on x € W l (so that we may write 
£fc = £(xfc)), and that the following assumption holds. 

Assumption E. The noise vector £(x) decomposes as £(x) = £(0) + C( x )> where £(0) is a process 
satisfying E[£fc(0)£fc(0) T | Rk-i] A X >- 0 for some matrix X £ M. dxd , the boundedness condition 
E[sup fc E[||£ fc (0)|| 2 | F k -i}\ < 00 , and 

n\Ck(x)\Y 

for some constant C < 00 and all x € M d . 


Tk- 1 ] < C ||x — x* 


As in the convex case, we make one of two additional assumptions, which should be compared 
with Assumptions [Cl and 1C'I The first is that R gives globally strong information about x*. 





Assumption F (Strongly convex residuals). There exists a constant Ao > 0 such that for all 
x € (W(x — x*),R(x)) > AoF(x — x*). 


Alternatively, we may make an assumption on the boundedness of R, which we shall see suffices 
for proving our main results. 

Assumption F’ (Bounded residuals). There exist Ao > 0 and e > 0 such that 
(W (x-x*),R(x)) 


inf 

0 < ||a:—cc* || <e V [X — X*) 


> Ao and inf {VV(x — x*), R(x)) > 0. 

e< lla;—3:* II 


There also exists some C < 00 such that ||i?(x)|| < C and E[||^|| 2 | F^-i] < C 2 for all k and x. 

With these assumptions in place, we obtain the following more general version of Theorem [T] 
indeed, we show in the sequel how Theorem [T| follows from this result. 


Theorem 2. Let V be a function satisfying inequality (ED, 
Let the stepsizes a*, = ak 13 , where < (5 < 1. 

Then x n ^4' x* and 


1 “ 

—F= y ~](xk -I*)^N (0, H~ 


and let Assumptions Wi I7d and IXl hold. 
Let one of Assumptions [F] or [~Fl hold. 


H _1 ) . 


We may compare this result to Polyak and .Juditskvl ’s Theorem 2, which gives identical covari¬ 
ance matrix and asymptotic convergence guarantees, but with weaker conditions on the function 
V and stepsize sequence ak- Our mildly stronger assumptions—namely, Assumptions IF1 and IF 7 ! are 
stronger versions of Assumption 3.1 of Polyak and Juditsky [26|, which requires only the condi¬ 
tions on V and R of Assumption [P]— allow our result to apply even in the asynchronous settings 
considered in this paper. 


3 Experimental results 


We provide empirical results studying the performance of asynchronous stochastic approximation 
schemes on several simulated and real-world datasets. Our theoretical results suggest that asyn¬ 
chrony should introduce little degradation in solution quality; we also investigate the engineering 
techniques necessary to truly leverage the power of asynchronous stochastic procedures. In our ex¬ 
periments, we focus on linear and logistic regression, the examples given in Section \2 .21 that is, we 
have data (a,, bf) G x K (for linear regression) or (a*, bi ) € M. d x {—1,1} (for logistic regression), 
for i = 1,..., N, and objectives 

1 N l N 

f{x) =—^{{a.i,x)-bif and f{x) = — ^log (l + exp(-6j (a*,z))). (8) 

i=l i= 1 


We perform each of our experiments using a 48-core Intel Xeon machine with 1 terabyte of 
RAM, and have put code and binaries to replicate our experiments on CodaLab llj. The Xeon 
architecture puts each core onto one of four sockets, where each socket has its own memory. To limit 
the impact of communication overhead in our experiments, we limit all experiments to at most 12 
cores, all on the same socket. Within an experiment—based on the empirical expectations © —we 
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iterate in epochs, so that our stochastic gradient procedure loops through random permutations 
of all examples, touching each example exactly once per epoch in a different random order within 
each epoch (cf. 0)0 We use the following two schemes for the stepsize a^. 

Decreasing stepsizes. We set /3 = 0.55 and let = k~&. The value written on the shared 
iteration counter k by one processor may be overwritten by other processors. 


Exponential backoff stepsizes. We use a fixed stepsize a, decreasing the stepsize by a factor of 
0.95 between each epoch (this matches the experimental protocol of Niu et al. 0] and follows 
Hazan and Kale lj] and Ghadimi and Lan [13l]h 


To address issues of hardware resource contention (see Section 13.21 for more on this), in some cases 
we use a mini-hatching strategy. Abstractly, in the formulation of the basic problem (jT]), this 
means that in each calculation of a stochastic gradient g we draw B > 1 samples W\, ..., Wb i.i.d. 
according to P, then set 

1 B 

g(x) = -^VF(x;W 6 ). (9) 

° 6=1 

The mini-batching strategy @ does not change the (asymptotic) convergence guarantees of asyn¬ 
chronous stochastic gradient descent, as the covariance matrix £ = E[g(x*)( 7 (x*) T ] satisfies £ = 
^E[V_F(x*; W)\7F(x*-, W) T ], while the total iteration count is reduced by the a factor B. Lastly, 
we measure the performance of optimization schemes via speedup, defined as 


average epoch runtime on a single core using stochastic gradient descent 

speedup =--- : ------—--. (10) 

average epoch runtime of asynchronous method on m cores 


In our experiments, we see that increasing the number m of cores does not change the gap in 
optimality f(xk ) — /(x*) after each epoch, so speedup is equivalent to the ratio of the time required 
to obtain an e-accurate solution using a single processor/core to that required to obtain e-accurate 
solution using m processors/cores. 


3.1 Efficiency and sparsity 

For our first set of experiments, we study the effect that data sparsity has on the convergence 
behavior of asynchronous methods using the linear regression objective ([8]). Sparsity has been an 
essential part of the analysis of many asynchronous and parallel optimization schemes 0 , 0 , 0 , 
while our theoretical results suggest it should be unimportant, so understanding these effects is 
important. We generate synthetic linear regression problems with N = 10 6 examples in d = 10 3 
dimensions via the following procedure. Let p nz € (0,1] be the desired fraction of non-zero gradient 
entries, and let II Pnz be a random projection operator that zeros out all but a fraction p nz of the 
elements of its argument, meaning that for a € R d , n pnz (a) uniformly at random chooses p nz d 
elements of a and leaves them identical, zeroing the remaining elements. We generate data for our 
linear regression by drawing a random vector u* ~ N(0, Idxd), then constructing 


bi = (ai,u*) + £i, where e* 1 '~' N(0,1), a* 1- ~' N(0, Idxd), and ai = U Pnz (ai) (11) 

for i = 1 ,... ,N, where II Pnz (di) denotes an independent random sparse projection of a*. To measure 
optimality gap, we directly compute x* = (A T A)~ 1 A T b, where A = [oi • • • ajv] T € M. Nxd . 
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(a) Pnz = .005 



(b) Pnz = .01 



(c) Pnz — -2 
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(d) Pnz = 1 


Figure 1. Decreasing stepsizes: Optimality gaps for synthetic linear regression experiments showing 
effects of data sparsity and asynchrony on f(xk) — /(a;*) with stepsize ak = . A fraction p nz of 

each vector a, £ is non-zero. 



(a) Pnz = -005 



(b) Pnz = -01 



10 cores] 
8 cores 
» 4 cores 
1 core 



(c) Pnz — -2 


(d) Pnz — 1 


Figure 2. Exponentially decreasing stepsizes: Optimality gaps for synthetic linear regression exper¬ 
iments showing effects of data sparsity and asynchrony on /( Xk) — f{x*) with epoch-based stepsizes 
ttepoch k = -95 fc . A fraction p nz of each vector Oj £ is non-zero. 


In Figures Q] and [2] we plot the results of simulations using densities p nz £ {.005, .01, .2,1} and 
mini-batch size B = 10, showing the gap f{xk) — f{x*) as a function of the number of epochs for each 
of the given sparsity levels. Figure [Ogives results for our simulated data (fTT1) using the decreasing 
stepsize scheme at = with (3 = .55, while Figure [2] gives results using the exponential backoff 
scheme of 14, 13!, 25], where stepsizes are chosen per epoch as a ep0 ch k = -95 fc . Each plot includes 
error bars with standard errors over 10 random experiments using different random seeds (the 
errors are generally too small to see in the plots). We give results using 1, 4, 8, and 10 processor 
cores (increasing degrees of asynchrony). From the plots, we see that regardless of the number 
of cores, the convergence behavior is nearly identical, with minor degradations in performance for 
the sparsest data. (We plot the gaps f(xk) — fix*) on a logarithmic axis.) Moreover, as the data 
becomes denser, the more asynchronous methods—larger number of cores—achieve performance 
essentially identical to the fully synchronous method in terms of convergence versus number of 
epochs. 

In Figures [3] and |4] we plot the speedup achieved for the synthetic regression problem with 


1 Strictly speaking, this violates the stochastic gradient assumption, but it allows direct comparison with the 
original Hogwild! code and implementation fiBI ]. 
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( a ) Pnz — -005 (b) p nz — .01 (c) p nz — .2 (d) p nz — 1 

Figure 3. Decreasing stepsizes: Speedups for synthetic linear regression experiments showing effects 
of data sparsity on speedup m with stepsize ak = k /3 . A fraction p nz of each vector a* £ is 
non-zero. 



Figure 4. Exponentially decreasing stepsizes: Speedups for synthetic linear regression experiments 
showing effects of data sparsity on speedup (flOl) with epoch-based stepsizes a e poch k = -95 fc . A 
fraction p nz of each vector £ M. d is non-zero. 


data dill) using different numbers of cores for the experiments in Figures [T] and [2] (as before, Fig. [3] 
uses stepsizes and Fig. |4] uses stepsizes exponentially decreasing between epochs). As 

a point of comparison, we also implement a synchronized method, which uses multiple cores to 
compute gradients independently g l on each core i = l,...,c, then computes the average g = 
c Xa=i 9 l an d uses that to perform a standard stochastic gradient update; this requires explicit 
synchronization (locking) of the updates, though with sufficiently large batch sizes B computation 
may theoretically overwhelm the communication and locking overhead [?]. For comparison, we use 
the same batch size B = 10 for each of the synchronous and asynchronous procedures, and we 
see see that the performance of the naive locking strategy is worse than than the asynchronous 
gradient method across all data densities p nz . At higher densities, more computation is necessary 
within each gradient computation, so that the communication overhead causes less performance 
degradation for the synchronous method, yet the asynchronous method also benefits and attains 
better relative performance. We see clearly that data sparsity is not necessary for the asynchronous 
gradient method to enjoy substantial perfromance benefits. 
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No batching (B = 1) 

Number of cores 

1 

4 

8 

10 

fraction of LI misses 

0.0021 ± 0.0001 

0.0061 ± 0.0001 

0.0096 ± 0.0001 

0.0102 ± 0.0001 

fraction of L2 misses 

0.50 ± 0.01 

0.63 ± 0.01 

0.76 ± 0.01 

0.78 ± 0.01 

fraction of L3 misses 

0.41 ± 0.01 

0.25 ± 0.01 

0.24 ± 0.01 

0.25 ± 0.01 

epoch average time (s) 

4.55 

1.85 

1.61 

1.47 

speedup 

1.00 

2.46 ± 0.01 

2.83 ± 0.01 

3.09 ± 0.01 


Batch size B = 10 

Number of cores 

1 

4 

8 

10 

fraction of LI misses 

0.0027 ± 0.0002 

0.0033 ± 0.0001 

0.0043 ± 0.0001 

0.0046 ± 0.0001 

fraction of L2 misses 

0.44 ± 0.01 

0.50 ± 0.01 

0.60 ± 0.01 

0.63 ± 0.01 

fraction of L3 misses 

0.35 ± 0.03 

0.33 ± 0.01 

0.33 ± 0.01 

0.33 ± 0.01 

epoch average time (s) 

2.97 

0.87 

0.58 

0.51 

speedup 

1.00 

3.42 ± 0.01 

5.16 ± 0.02 

5.80 ± 0.03 


Table 1 . Memory traffic for batched updates © versus non-batched updates (B = 1) for a dense 
linear regression problem in d = 10 3 dimensions with a sample of size N = 10 6 . Cache misses are 
substantially higher with B = 1. 


3.2 Hardware issues and cache locality 

We now detail a set of experiments investigating hardware issues that arise even in the implemen¬ 
tation of asynchronous gradient methods. The Intel x86 architecture (as with essentially every 
processor architecture) organizes memory in a hierarchy, going from level 1 to level 3 (LI to L3) 
caches of increasing sizes. An important aspect of the speed of different optimization schemes is 
the relative fraction of memory hits , meaning accesses to memory that is cached locally (in order 
of decreasing speed, LI, L2, or L3 cache). In Table [1] we show the proportion of cache misses at 
each level of the memory hierarchy for our synthetic regression experiment with fully dense data 
(Pnz = 1) over the execution of 20 epochs, averaged over 10 different experiments. We compare 
memory contention when the batch size B used to compute the local asynchronous gradients ([9]) is 
1 and 10. We see that the proportion of misses for the fastest two levels- 1 and 2—of the cache 
for B = 1 increase significantly with the number of cores, while increasing the batch size to B = 10 
substantially mitigates cache incoherence. In particular, we maintain (near) linear increases in 
iteration speed with little degradation in solution quality (the gap f(x ) — f(x*) output by each of 
the procedures with and without batching is identical to within 10~ 3 ; cf. Figure [21(d)). 


3.3 Real datasets 


We perform experiments using three different real-world datasets: the Reuters RCV1 corpus [l9j], 
the Higgs detection dataset p], and the Forest Cover dataset [ 2 d]. Each represents a binary clas¬ 
sification problem, which we formulate using logistic regression (recall Sec. 12.21) . We briefly detail 
relevant statistics of each: 


(1) The Reuters RCV1 dataset consists of N « 7.81 • 10 5 data vectors (documents) a, € {0, l} rf 
with d ~ 5 • 10 4 dimensions; each vector has sparsity approximately p nz = 3 • ICC 3 . Our task is 
to classify each document as being about corporate industrial topics (CCAT) or not. 
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(a) RCV1 ( p nz = .003) (b) Higgs ( p nz = 1) 


(c) Forest ( p nz = 1) 


Figure 5. Decreasing stepsizes: Optimality gaps f(xk) — f(x*) on the (a) RCV1, (b) Higgs, and (c) 
Forest Cover datasets using stepsize ak = with /3 = .55. 


(2) The Higgs detection dataset consists of N = 10 6 data vectors a* € W l °, with do = 28. We 
quantize each coordinate into 5 bins containing equal fraction of the coordinate values and 
encode each vector a, as a vector a* € {0,1}^ with d = 5do whose non-zero entries correspond 
to quantiles into which coordinates fall. The task is to detect (simulated) emissions from a 
linear accelerator. 

(3) The Forest Cover dataset consists of N « 5.7 • 10 5 data vectors a, € {—1, l} d with d = 54, and 
the task is to predict forest growth types. 


Thus, each dataset gives a different flavor of optimization problem: the first is very sparse and high- 
dimensional, the second is somewhat sparse and of moderate dimension, while the forest dataset is 
dense but of relatively small dimension. These allow a broader picture of the performance of the 
asynchronous gradient method. 

We follow the same experimental protocol as in our simulated data experiments. That is, we 
perform 10 experiments for each dataset using 1, 4, 8, and 10 cores, where each experiment consists 
of running the asynchronous gradient method for 20 epochs, within each of which examples are 
accessed according to a new random permutation. We use a batch size B = 10 for each experiment, 
and collect standard errors for the (estimated) optimality gaps. In these experiments, as a proxy 
for the optimal value fix*) we run a synchronous gradient method for 100 epochs, using its best 
objective value as f(x*). In Figures [5] and [6l we plot the gap f{x]f) — f (x*) as a function of epochs, 
giving standard error intervals, for each of the three datasets. The figures show there is essentially 
no degradation in objective value when using different numbers of processors, that is, asynchrony 
appears to have negligble effect for each problem, whether we use stepsizes ak = (with j3 = .55) 
that we analyze or the epoch-based exponentially decreasing stepsize scheme used by Niu et al. 
[25[ | . also analyzed by Hazan and Kale 14] and Ghadimi and Lan 13]. In Figures 0 and 0 we 
plot speedup achieved for these same experiments. The asynchronous gradient method iteration 
achieves nearly linear speedup of between 6x and 8x on each of the datasets using 10 cores. 


14 











































f( x k) - f{ x *) 


10' 1 


10' 1 


10' 1 



♦ 10 cores 
— 8 cores 
-<- 4 cores ^ 
1 core 


:: 


0 5 10 15 20 

Epochs 


(a) RCV1 (p nz = .003) 



tr 


* 10 cores 
8 cores 

4 cores 

1 core 


I 




"" T 


> } ; 

0 5 10 

15 

20 


Epochs 

(c) Forest ( p nz = 1) 


Figure 6. Exponentially decreasing stepsizes: Optimality gaps f{xk) — f(x*) on the (a) RCV1, (b) 
Higgs, and (c) Forest Cover datasets with epoch-based stepsizes a ep och k = -95 fe . 


4 Proofs 

In this section, we present proofs of our two main theorems, deferring the proofs of technical lemmas 
to subsequent appendices. 

4.1 Proof of Theorem |T| 

We prove Theorem Q] by a reduction to Theorem [2] For both settings of Theorem [TJ we will use 
V(x) = \ |x || 2 and R(x) = V/(x), then apply Theorem [2J With these choices, Assumption iDl 
is satisfied with the Hessian H = V 2 /(x*) >- 0 and 7 = 1 by a Taylor expansion of /, which is 
assumed continuously twice differentiable near x*. 

Let us also verify that Assumption lEl holds for an appropriate noise sequence in the stochastic 
convex optimization setting. Throughout, we also use the sequence of u-fields J~k defined by 
expression ([ 6 ]), so that x k € Tk-i, and the counter k implicitly gives a gradient g k , point 37 , 
sample Wk, and noise error 

ffc = 9k~ R{x k ) = VT(x fc ; W k ) - V/(x fc ). ( 12 ) 

That is, we have gk = R(x k ) + £& as in the nonlinear setting of Theorem [21 We first verify that £*. 
is a martingale difference sequence: we have E[£*, | T k - 1 ] = V/(xfc) — V/(xfc) = 0 because Wy. is 
independent of Xk- We also have the decomposition 

ik = Vf(x*; W k ) + VF(x fc ; W k ) - Vf(r*; W k ) - V/(x fc ), 

'---' -V-' 

£fc(0) C k{xk) 

and that ||V/(x fe )|| = ||V/(x fe ) - V/(x*)|| < L \\x k - x*||, so 

E[||C fc (x fc )|| 2 | T k - 1 ] = E[||VF(x fc ; W k ) - Vf(r*; W k )\\ 2 \ T k - 1 ] + ||V/(x fc )|| 2 < C ||x fc - cc *|| 2 
by inequality (j3j). Moreover, we have 

E[&(0)£*(0) T I Tk- 1 ] = E[VF(E; W fc )VF(x*; W k ) T \ T k -i] = £, 
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(a) RCV1 (p nz = .003) 



(b) Higgs ( p nz = 1) 



(c) Forest ( p nz = 1) 


Figure 7. (Decreasing stepsizes) Logistic regression experiments showing speedup (flOl) on the (a) 
RCV1, (b) Higgs, and (c) Forest Cover datasets. 


as the random variable Wk is independent of J~k—\ ■ That is, Assumption lEl holds. 

Now we show that Assumption IF1 holds whenever Assumption [C] holds. If / is strongly convex, 
then taking R(x ) = V/(x) and V(x) = \ ||x|| 2 we have for any x,y € that 

f(y) > /(*) + <V/(x), y - x) + ^ ||x - y\\ 2 and /(x) > f(y ) + (V/(y), x - y) + ^ ||x - y\\ 2 . 

Taking y = x* in the preceding expression while noting that V/(x*) = V/(y) = 0, we have 

(W(x - x*), R(x)) = (x - x*, V/(x)) = (V/(x) - V/(x*), x - x*) > A ||x - x*|| 2 = 2AF(x - x*). 

Clearly, V has 1-Lipschitz gradient and satisfies V(x — x*) > \ ||x — x*|| 2 . 

Now we consider conditions on the convex function / under which Assumption [FI is satisfied. In 
particular, let / be a differentiable Lipschitz-continuous convex function defined on R d , and assume 
that / is locally strongly convex near x*, meaning that there exist A > 0 and e > 0 such that 

f(y) > f(x) + (V/(x),y - x) + ^ ||x — y|| 2 for x,y s.t. ||x — x*|| < e, ||y - x*|| < e. 

In particular, as V/(x*) = 0, we have /(x) > f(x*) + | ||x — x*|| 2 for all x such that ||x — x*|| < e. 
We have the following lemma on the growth of such functions. 

Lemma 1. Let f satisfy the conditions in the preceding paragraph. Then 

f(x) > /(x*) + ^min{||x - x*|| 2 ,e ||x - x*||}. (13) 

Deferring proof of Lemma [H we show how it implies that the conditions of Assumption [P] are 
satisfied with the Lyapunov function V(x) = \ ||x|| 2 and residual operator R(x) = V/(x). Indeed, 
by applying the strong convexity inequality with y = x*, we have for x such that ||x — x*|| < e that 

(W(x - x*), R(x)) = (x- x*, V/(x)) = (V/(x) - V/(x*), x - x*) > A ||x - x*|| 2 . 
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(a) RCV1 ( p nz = .003) 



(b) Higgs ( p nz = 1) 



(c) Forest ( p nz = 1) 


Figure 8. Exponentially decreasing stepsizes: Logistic regression experiments showing speedup m 
on the (a) RCV1, (b) Higgs, and (c) Forest Cover datasets with epoch-based stepsize a ep och k = -95 fc . 


Now we claim that 


inf (VV(x-x*),R(x)) = inf (x - x* , V/(®)) > 0. (14) 

ir:||a?—£*||>e x:||a:—a;*||>e 

To see this, note that by claim (1131) . for x such that ||a: — x*|| > e, we have for some constant c > 0 
that f(x) — f(x*) > c ||cc — z*||, while we have f(x*) > f(x) + (V/(x),x* — x), so that 

(Vf(x),x — x*) > f(x) — f(x*) > c\\x — x*\\ > cc for all x s.t. \\x — x*|| > e. 

Additionally, whenever Assumption [C3 holds, we have E[||VF(x; LF)|| 2 ] < G 2 , so that all the 
conditions of Assumption [F3 are satisfied. Except for the proof of Lemma [U this completes the 
proof of Theorem [I] 

Proof of Lemma [l] Fix y € M. d and let h y (t ) = f{x* + ty/ ||y||) — f(x*). Notably, h y is a 
one-dimensional convex function, and h y (t) > (A/2 )t 2 for |/| < e. As the slopes of convex functions 
are non-decreasing (cf. Hiriart-Urruty and Lemarechal J 1 ■ll . Chapter I]), we have 

, = h y {e + 5) - h y (e) > h y (e) - h y { 0) > Ae^ = Ae 

yK ’ <5— >-0 5 - e “ 2e 2 ' 

This inequality implies that for any t > e, we have 

+ t W\^ ~ f( x *) = byjt) - hy( 0) h y (e) - h y ( 0) Ae 
t t ~ e - 2 ’ 

while for 0 < t < e, we use that 1 1 ->- ( h y {t ) — h y (0))/t is non-decreasing in t to obtain 

/(** + *n5[) - /(**) _ h y (t) - h y { 0) ^ At 2 _ At 
t _ t _ ~2t ~ ~2' 

Combining the two preceding displays, we have /(/ + lA)-/(i*) > ^ rriin {e, t}, which is equiv¬ 
alent to inequality (fT3l) . □ 
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4.2 Proof of Theorem [2] 


Before beginning the proof of the theorem proper, we state a martingale convergence lemma nec¬ 
essary for our development, then give an outline of the proof to come. 


Lemma 2 (Robbins and Siegmund [30|]). Let F\ C Ti C • • • 
non-negative F n -measurable random variables such that 


be a filtration and V n , f3 n , n n , e n be 


®[Vn+l | F n \ < (1 + /3n)Vn + K n — £ n . 

On the event that fin < oo and K n < oo, we have V n V for a non-negative random 

variable V with V < oo almost surely, and Ylri=i £ n < oo a.s. 

We prove Theorem [2] by relating the sequence x k from expression @ to a sequence whose 
performance is somewhat easier to analyze, and which has values more closely approximating a 
“correct” stochastic gradient iteration: we define 


k-i 

x k ■= - ^2 angi and A k := x k - x*. (15) 

i=1 


With this iteration, we have that x k € F k -i, where F k is the c-field defined in expression (j6j), 
and (we show) it is close enough to the correct iterates x k to give our desired results. The idea of 
analyzing a corrected sequence for distributed iterations builds out of Bertsekas and Tsitsiklis [1, 
Chapter 7.8] and has been used [e.g. |s| for distributed and parallel optimization problems. 

Because delays may grow unboundedly under Assumption [A] we consider the effects of delay 
increasing polynomially with n. With this in mind, for the proof and all internal lemmas, we let p 
be a fixed constant satisfying 


1 

r — 1 


< P < fi 


1 

I + 7’ 


(16) 


where the stepsizes a k = ak~P, r is the moment in Assumption lAl and 7 € (0,1] is the power in 
Assumption [Dj The interval (11611 is assumed to be non-empty by the conditions of the theorem. 
Note that this implies that p € — |). We will show that n p functions as a bound on the 

delays in incorporating gradient information. 


Outline of proof We provide a brief outline before giving the remainder of the proof. First, 
we show that there is (asymptotically) a finite bound such that the delays M n on asynchronous 
updates are at most of order n p by Assumption [A] (Lemmas [3] and HD- Then we show that the 
“corrected” sequence x k (and A k ) converges appropriately (Lemma|5|, assuming that the true errors 
A k do not diverge, giving almost sure convergence of A k using the Robbins-Siegmund martingale 
convergence theorem (Lemma [2]). We use these results to show that A*., A k , and A' k , where A' k 
is defined by the simpler linear matrix iteration v k+ i = (/ — a k H)v k — a k f k with = v k — x*, 
are all asymptotically equivalent in probability (Lemmas [6l [3 and [8]) , as long as the errors A k are 
assumed to stay bounded. In particular, the differences ||Afc — Afc|| 2 = Op(a k k 2p ), that is, they 
scale quadratically in the stepsize a k and with some penalty for delays, and the errors tend to 
zero as long as p is not too large; asynchrony is dominated by the magnitude of observed gradient 
noise. Asymptotic normality fLemma llOK of the equivalent sequences A k , A k , A' k then follows from 
results of Polyak and Juditsky ( 2 ^], which guarantee a central limit theorem for the sequence A' k , 
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because the error bounds on £ of Assumption [E] guarantee that £& eventually behaves like an i.i.d. 
sequence. Lastly (in Lemma fTTT) , we show that our overarching assumption—that the true errors 
Afc did not diverge—in fact holds under the assumptions of the theorem. 

We now turn to the proof of Theorem [2] proper. 

Lemma 3. Let p > -Ay and £ n be the event that E nk ^ Idxd for some k < n — n p . Then 

P (£ n occurs infinitely often ) = 0. 


Proof We have that E nk / I if and only if M*. > n — k + 1, so that 


P(I / E nk ) < P(M fc > n - k + 1) < 


E [M, 


< 


M T 


(n- k + l) T ~ (n — k + 1) T ' 


Letting £ n be the event that / ^ E nk for some k < n — n p as in the statement of the lemma, 


P(£ n ) = P(Mfc > n — k + 1 for some k < n — n p ) < 


M r 


—( (n - k + l) 1 


11 ]\/TT I'll 

E TT< t- T dt<(n p ) 1 - T = n p ( 1 - T \ 

_oii ^ J n p 


k=nP-\-l 


Thus we find that 


n=1 n=1 


(0 

< OO, 


where inequality (i) holds if and only if p(r — 1) > 1, or p > -Ay. Applying the Borel-Cantelli 
lemma gives the result. □ 


As an immediate consequence of this lemma, we obtain the following. 

Lemma 4. Let ak = ak~@, where /3 € [0,1). For any p £ (A^, 1), with probability 1 we have 


limsup sup 

n zeRI!: 


ELi W-e 


1 nk I 


Zk 


ELi <*k\\l-E 


'nk I 


E n 

k=n-nP Z k 


< 1 and sup sup n 

n zeR+ OL n l^k=n-nP Z k 


Zk 


< OO, 


where we treat 0/0 = 1. 

Proof The first statement of the lemma follows from Lemma [3j as with probability 1 over the 
delays Mk and delay matrices E nk , there exists some (random) N such that n > N implies that 
E nk = I for all k < n — n p , so that for n > N, we have for all nonnegative sequences zi, Z 2 , ■ ■ ■ that 

ELip-£ nfc lk / Ek=n-nP Z k , 

E n ■— v-^n • 

k—n—nP k 2-^k=n—n p k 

The second follows from the first once we note that for k G [n — n p , n], we have 

l> — > n~ B (n — n P Y = (1 — n p_1 ) /3 = exp(—/3n p_1 )(l + o(l)) — »• 1 
otk 
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as n —>• oo. The limit supremum is finite, so the supremum must likewise be finite. 


□ 


In particular, Lemma 0] implies that any sequence a n a k \\l ~ E nk [| cannot diverge 
more quickly than a 2 Y2k=n-nP ^k- We will use this fact frequently. For the remainder of the proof 
of Theorem 0J we define the random variable (implicitly depending on the power p chosen in the 
interval (fTtfll ) 


K n := max sup 


I-E 


nk I 


z k 


m < n zgR^ 


a r 


E m 

k=m—m p ^k 


and Kqo := limsup K r 


(17) 


As K n are non-decreasing, we have = lim„, K n = sup n K n , and by Lemma 01 we see that with 
probability 1 over the delay process, we have Koo < oo, and moreover, we have K n G J~ n -\ by 
definition ([6]) of the er-fields I~k- For t G R define 


)Cn,t ■= {K„ < t} and JC 


oo ,£ • — 


Pi K-n,t = \ sup K n<t> 

n {71 ) 


(18) 


be the events that K n and Kqo are bounded by t, respectively, noting that K, U) t G T n - 1 as K n G T n -\ 
as before. Then Lemma 0] implies 


sup K n < t 1 =1. 

n ) 

Our first lemma, whose proof we provide in Sec. IA.11 builds off of the Robbins-Siegmund 
martingale convergence theorem (Lemma 0]) to give an almost sure convergence result for the 
corrected sequence A n . 

Lemma 5. Let Assumptions [D| and\A\hold and the stepsizes ak = ak~^. 

(a) If Assumption^ holds, let t < oo and assume additionally that sup n E[1 {/C nj t} ||A n || 2 ] < oo. 
Then there is a finite random variable Vt such that 1 V (A n ) ^4' V* and 

OO 

Y anl (A n ), R(x n < oo. 

71=1 


lim P(n n >i K n) t) = lim P(/Coo t ) = lim P 

£—>•00 £—> oo t—>oo 


(b) If Assumption [FI holds, there is a finite random variable V such that V(A n ) °4' V, and 

OO 

Y «n ( W (An),R(Xn)) < OO. 

71=1 


We can now verify that A n ^4' 0 under the conditions of Lemma EJ First, let Assumption IF1 
hold, and let e > 0 be the radius for which (VV(x — x*), R(x)) > AoR(x — x*) for ||a: — a:*|| < e. 
With co := inf a ..|| x _ a .*|i >e (W(x — x*),R(x)) > 0, we have 

OO OO 

Y a n min{A 0 R(A n ), cp} < Y a n (vR(A n ), R(x n )^ < oo, 

71=1 71=1 
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and V(A n ) 4 ' V for some random variable V. If P(V > 0) > 0, there exist realizations of the 
randomness in the problem such that V > 0, and for such realizations there must exist eo > 0 such 
that V(A n ) > eo for all sufficiently large n as V{A n ) 4' V ; this contradicts a n = °o, so we 

must have V = 0 a.s. under Assumption IFl 

Under the alternate Assumption [F] and that sup n E[l ||A„|| 2 ] < oo for all t, we have 

oo OO 

^2 1 {fcn,t} a n A 0 U(A n ) < ^2 1 {fcn,t} ct n (vV(A n ),R(x n )^ < oo. 

72—1 72—1 

As Y n = 00 > we must then have V(A n )l 4' 0, as we know it converges to something by 

Lemma 0 Using that lim^oo P(/C cx3) t) = 1 and K, n ^t A H n +\,t for all n, we hnd 

P(U(A n ) t 4 0) = lim P(/Coo t and U(A n ) 0) < limsupP (l V(A n ) 4 o) = 0 

t->°° t— kx> ' ' 

by the preceding discussion. In particular, we have 

U(A n ) “4- 0 (19) 


whenever the conditions of Lemma 0 hold. 

Now we show that the averages of A n and A n are asymptotically equivalent in distribution, and 
we have quantitative control over this equivalence. (See Section fA.21 for a proof of this lemma.) 

Lemma 6. In addition to the conditions of Lemma 0 assume that j3 > | and either (a) 

Assumption^ holds and the sequence C\ nt = maxfc< n E[l {/C/-4 || A^ || 2 ] satisfies sup n CA, n ,t < oo 
for allt € M or (b) Assumption\F^ holds. In case (a), there exists a universal constant C such that 


E 


1 {ICn,t} ||A n -AJ 2 


< Ca\n 2p t 2 {C\ n _i t + 1 ). 


( 20 ) 


Additionally, 


y/n 




^^(Afc — A f.) 4 0, 

k =1 



54 H Afc ~ 

k =1 


0 , 


( 21 ) 


and 

V (A n ) 4 0. (22) 

Thus, any distributional convergence results we are able to show on n~ 2 Yk=i A fc w bl also bold 
for the uncorrected sequence n~ 2 Yk=l A ^ as l° n g as tbe conditions of Lemma 0 hold. With this 
in mind, we give an additional equivalence result showing that A & is equivalent to an easier to 
analyze sequence of errors generated from a simpler matrix iteration. Let the noise sequence {6c} 
be generated as in the iterations dn} (1ml) . and consider the two iterations 

Afc + i = A k~ ak (R(xk) + 6c) and A / fe+1 = (L — atkH)A' k — a^k- (23) 

We have the following two results, which show (under slightly different conditions) that the differ¬ 
ences between the iterations (1231) tends to zero. The proofs of the lemmas are quite similar, so we 
put material relevant for the proof of both in Section IA.31 specializing to each of the lemmas in 
sections IA.3.11 and IA.3.21 respectively. 
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Lemma 7. Let Assumption [P] hold with some 7 € (0,1]. Let the stepsizes atk = ak P, where 
yA + Ay < ft < 1. Let Assumption [7] hold and assume that for each t £ 1 there is a constant 

CA,t < 00 such that E[1 {JC Ut t} || |||] < C\ t for all n. Then 

n 

n 2 — ^fe) ^ 0- 

k =1 

Lemma 8. Lei Assumption [P] hold with some 7 € (0,1]. Lei the stepsizes a k = ak~^, where 
yA + Ay < /3 < 1. Lei Assumption [Fj hold. Then 

n 

n 2 _ ^fe) ^ 0- 

k =1 

With the preceding lemmas in place, we require two additional claims that immediately yield 
our desired convergence guarantee. The first is the asymptotic normality of the matrix product 
sequence A' k , that is, that n~^ Ylk=i ^'k asymptotically normal. The second is that under 
Assumption [Fl we have sup fc E[|| A^.|| 2 1 {KL k ,t\] < oo for all i € R. We begin with the first result. 

Lemma 9 (Polyak and Juditsky (26|, Theorem 1). Let {£&} be a martingale difference sequence 
adapted to the filtration T k , so that E[£& | Tk-i\ = 0 and sup fc E[||^|| 2 | T k - 1 ] < 00 with probability 
1. Assume additionally that 

lim limsupE |"||£fc|| 2 l i||£fc|| > c} | Tk- ll = 0 in probability 

C^oo 

and that as k —>• 00 we have 

Cov(^fc | Tk— 1 ) —> E >- 0, 

where Cov(£ | T) = E[££ T | T\. Then if H y 0, the iteration 

A' k+1 = (I - a k H)A' k - a k £k 

with a k = ak~h satisfies 

1 n 

— V A' k A N (0, H~ 1 TH~ 1 ) . 

We verify that the conditions of Lemma [9] hold in the two settings captured by Theorem 0 
that is, under Assumption [F] and [P] For Assumption [P] we immediately have each condition on 
fk except that Cov(f k | Tk- 1 ) A E for some positive definite matrix E. But we have A k A' 0 as 
in expression (1221) , so that by Assumption |E] 

Cov(& | T k - 1 ) = E[&(0)&(0) T + C(®fc)C(®fc) T | T k - 1 ] = S + o P ( 1) + Op(\\x k - A|| 2 ) A E. 

Thus, the conditions of Lemma [9] hold under Assumption iPl We now argue that the conditions of 
the lemma hold under Assumption iFl and the additional condition that sup fc E[|| Afc|| 2 1 {!C k ,t}\ < 00 
for all t € R. In this case, we still know that E[£*. | T k ~ 1 ] = 0, and by Assumption [El we have 

supE [ll^ll 2 | T k -!] < 2supE [||£fc(0)|| 2 + ||C(x fc )|| 2 | T k - 1 ] 
k k 

< 2supE[||£ fc (0)|| 2 | T k - 1 ] + Psup ||A fc || 2 < 00 , 

k k 
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as || Ajfc|| °4' 0 by the result (1221) . The second condition on the limits of E[||^|| 2 1 {||^|| > c} | J~ k -\\ 
holds similarly, as we have lim^oo P(sup fc || Afc|| > c) = 0 because ||A^|| ^4 0, again using the 
guarantee (1221) if sup fe E[1 {)Ck,t} ||Afc|| 2 ] < oo. The covariance condition follows identically to the 
argument under Assumption fpl Summarizing, we have shown the following result. 

Lemma 10. Let either (i) Assumption^ hold and assume that sup fc E[||A|||l {)Ck,t}] < oo f or eac h 
t or (ii) AssumvtionlF] hold. Then the sequence A' k defined by the iteration ([23]) satisfies 

n 

- 7 ='£A' k &N{0,H- 1 EH- 1 ). 

v k =l 


By Lemmas 0 and [H] and the probabilistic equivalence (1211) of the sequences y/nA n and y/nA n , 
Lemma flOl implies that (under the conditions of the lemma) 

n 

— ^ A fc 4 N (0, H^TH- 1 ) . 

v k =l 

This is the statement of the theorem under AssumptionlF’l the proof of the theorem will be complete 
if we can show that under the strong convexity Assumption [F] we have sup fc E[1 {IC k)t } || A^ || 2 ] < oo 
for all t G R. 

We present a final lemma that gives the boundedness of the sequences E[1 {/C k; t} l|Afc|| 2 ]. 

Lemma 11. Let Assumption [7] hold and e > 0, t € R. There exists some N = N(e,t) € N such 
that n> N implies 

E[||A n+1 || 2 l{/C n+M }] < e 2 maxmax{l,E[l{/C fcj t} ||A fc || 2 ]}. 


Now, Lemma [6] inequality (12(71) . and the fact that the stepsize power /3 > p (recall the inter¬ 
val (1161) ) immediately yields that for any e > 0 and t € R there is some N = N(e,t) € N such that 
n > N implies 


E 


l{£n,t} || A„ - A„|| 2 < e 2 max max {l, E [l {JC kt } ||A fc || 2 ]} . 

J k<n K L J J 


We combine these inequalities and LemmafTIlto argue that under AssumptionlFl we have sup fe C\ k t 
sup fe E[1 {)Ck,t} ||Afc|| 2 ] < oo. Indeed, choose e > 0 ,t G R and N = N(e,t) such that n> N implies 


E 


| A„ - A n || 2 l {JC n j} < — maxmax{l,E [l{JC k ,t} ||A fc || 2 ]} 

J 4 k<.n 


and 


Then 


E 


1 {IC n ,t} ||A n || 2 < — maxmax{l,E [l{JC k ,t} || A fc || 2 ] } . 
J 4 k<n 


E 


1 {JC n , t } || A r 


< 2E 


1 {lCn,t} || A n — A,; 


+ 2E 


1 {JC n ,t} || A r 


< e 2 maxmax{l,E [l{/C n ,t} ||A^|| 2 ] } 

k<n ’ 
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Repeating this chain of inequalities for all k such that N < k < n, we find that 


E 


l{/C n , t }||A ri 


< e 2 max max {l, E [l {ICk,t} ||Afc|| 2 ] } 


for all n > N. As maxfc<jvE[|| A*.|| ] < oo for any finite N, this shows that there exists a constant 
Caj. < oo such that sup fc E[1 {JCk,t} ||Afc|| 2 ] < C\ whenever Assumption [F] holds and the stepsizes 
ctfc are chosen such that a & = ak for fi € (|, 1) and satisfying p < /3 — This completes the 
proof of Theorem [2} 

5 Discussion and conclusions 


In this paper, we have analyzed an asynchronous gradient method, based on iNiu et al.’s Hog 


wild! [25|], for the solution of stochastic convex optimization and variational equality problems. 
Our work shows particularly that asynchrony introduces essentially negligible penalty for stochas¬ 
tic optimization problems under standard optimization assumptions, which can be leveraged in 
the development of extremely fast optimization procedures. Our experimental results in Section [3] 
show that there is still work to be done in terms of a deep understanding of implementation of 
these methods. In particular, even without inherent competition for locks or other synchronization 
resources in the computer, there can be competition for other resources, such as memory access. As 
Table Q] demonstrates, even moderately careful control of memory accesses can be extremely bene¬ 
ficial, and without it, asynchronous methods do not enjoy the performance benefits made possible 
by multi-core and multi-processor systems. It will thus be beneficial, in future work we hope to 
undertake, to develop an understanding of memory access and use similar to that now well-known 
in the scientific computing literature (see, for example, Ballard et al. [|4|). This understanding will 
greatly improve the practical effectiveness of stochastic and asynchronous methods. 


A Technical proofs for Theorem [2] 

In this appendix, we collect the technical proofs required for Theorem [2j We also state a few 
additional technical lemmas. 

Lemma 12. Define CA,n,t '■= max^< ri IE[1 ||Afc|| 2 ]. If Assumption{I]holds, there is a constant 

c < oo independent of n and CA,n,t such that for any l e R +; 


< c-t l n p (C\ n _ ljt + 1). 


77,-1 

E K^l {/C n ,t} E + 

k=n—nP 

If Assumption [FI holds, there is a constant c < oo independent of n such that 

77,-1 

E ll*(z*) + &ll : 

_k=n—n p 


E 


< cn p 


Proof We begin with the result under Assumption |Fl where we use E[||Afc|| 2 l {/C ni t}] < C\ n-11 
for k < n — 1. We also know that 


\\R(x k )\\ 2 < ||A fe || 2 and E[||£ fc || 2 | F k -i\ < ||A fc || 2 + 1, 
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where we have used Assumption [El and thus we have E[E[1 {/C^} ||^/c|| 2 | Fk-i]} < c{C\ n _ 11 + 1). 
In addition, we have ||i?(x)|| = ||i2(x) — i?(x*)|| < L \\x — x*|| and K^l {tCn,t} < t l 1 {/C n> t}, so 



72—1 

72 — 1 


E 

K„l{/C n , t } £ P(x fc ) + 4H 2 

< 2t l E 

1 {/C n ,i} I|i?(x fc )|| 2 + 1{/C n ,t} II4II 2 


k=n—n p 

k—n—n p 




< 1Ct l n p ( LCl n _ u + c(C-i >n _ lit + 1)) , 


where we have used that 1 A 1 for all n. Under Assumption IF1 the result is sim¬ 
pler: we have simply that E[||^|| 2 | T k - 1 ] < C and ||i?(xfe)|| < C , giving the result. □ 


We also give two technical results involving integral convergence. 


Lemma 13. Let c > 0 and k G (0,1) be constants and b > a > 0. Then 


f 

J a 


exp (— c{t K — a K )) dt < 


max{2 « , 1} 


KC 


c « T( — ) + a 

K 


1 — K 


See Appendix [AA] for a proof of Lemma[T3j The final technical result we use also gives a bound 
on (essentially) another gamma integral. 

Lemma 14. Let (3 G 1) and p < /3 — \ . Then 


n 



k=1 


See Appendix IA.61 for a proof of Lemma IT4l 


A.l Proof of Lemma [5] 

We begin by using the Lipschitz continuity of the gradients of V to note that 
V(A n+1 ) = V(A n - a n g n ) < V{A n ) - a n (w(A„),<?„} + ^ \\g n \\ 2 

= V{A n ) - a n (vv(A n ),R{x n )) - a n (vv(A n ), R(x n ) - R{x n )) - (vU(A n ),^) + ^ \\g n 

Taking expectations conditional on T n -\, we have x n ,x n G F n -\ , and E[£ n | F n -\] = 0. Moreover, 
we have g n = R(x n ) + and thus 

E[U(A n+1 ) | F n -i] < V(A n ) - a n (vU(A n ),R{x n j) + ^E[|| R(x n ) + ^ n || 2 | F n -i] 

+ a n \\VV(A n )\\\\R{x n ) - R(x n )\\ . 


Using that \\R(x n ) + £ n || 2 < 2 ||i?(a: n )|| 2 + 2 ||£ n || 2 and 


ll-^(^n) ^(^72)11 — L W'E'n ^n\\ — 


n—1 


72—1 


£ - E nk )g k <L^a k \\I- E nk || \\g k \\ 


k=1 


k=1 
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we find that there is a constant C such that 


E[F(A n+1 ) | T n - 1 ] < V(A n ) - a n (vV{A n ), R{x n )) + Ca\ ||R(x n )|| 2 + Ca 2 n E[||£J 2 | 

n— 1 

+ C'a n ||W(A n )|| Y, a k\\I ~ E nk || ||g fe || . 

k= 1 

Recalling the definition (1171) of the random variables K n and the selection (1161) of the power p, we 
have Efc=i “fell 7 “ ^11 IlSfcll < K n a n Y!k=n-nP llfffclli and using that 9k = R(xk) + €k, we obtain 

E[R(A n+1 ) | < V(A n ) - a n (w(A n ), R(2 n )) + Ca 2 n ||R(x n )|| 2 + Ca 2 E[||£ n || 2 I 1] 

n—1 

+ Ca 2 ||VR(A n )||K„ £ ||R(x fe ) + a-|| (25a) 

k=n—nP 

< V(A n ) - a n (VV(A U ), R(x n )} + Ca 2 n ||A n || 2 + Ca 2 n E[U n \\ 2 | .F n _i] 

71—1 

+ Ca 2 n n p V(A n ) + Ca 2 n K 2 £ ||R(x fe ) + &|| 2 , (25b) 

k=n—nP 


the final equality following because ||R(x)|| = ||R(x) — R(x*)|| < L\\x — x*\\, ||W(A)|| < L||A|| < 
{L/yf\)^V(A), and ab < \a 2 + \b 2 for any a, b € R. 

We now use the technical Lemma fl2l which allows us to control the error terms in inequal¬ 
ity (I25bj) . Indeed, by Lemma fl2l and our assumption that C\ t = sup fc E[1 {ICk,t} ||Afc|| 2 ] < oo if 
Assumption [F] holds, we obtain 


OO 


71 — 1 


71 — 1 

K 2 1 {Kn,t} E \\ R M + Ck\\ 2 

k=n—nP 


< 


Ct 2 (Cl t + l)Y a n nP ~ 

71 — 1 



u 2/3+p du < oo, (26) 


the final inequality holding when 2/3 — p > 1, or p < 2/3 — 1. In particular, the Robbins-Siegmund 
convergence theorem (Lemma[2j) applies, as we can write (recall inequality (125a[) and that 1 {/C n ,t} < 

E[1 {K"n,t} R(A n +i) | Tn- 1] < (1 + /3n-l)l {K-n-\,t}V(A n ) + K n - 1 — £n-l, 


where /3 n _i = Ca 2 n p , £ n 


1 {^n,i}an^VF(A n ),i2(a: n )^, and 


71 — 1 


^7i—l — 1 {fcn,t} 


C'a 2 ||R(x n )|| 2 + Ca 2 E[||en|| 2 |-Fn-i] + C'« 2 K 2 £ 


k=n—nP 

are all J^-i-measurable. Moreover, /3 n < E£°=i nP 2/3 < oo because p < 2/3-1, and )T n E[/c n ] < 
oo by the fact that E[E[||£ n || 2 | J^-i] < E[|| A„|| 2 + l] (this is Assumption lEl) coupled with Lemma fl2l 
and inequality (l26l) . We thus conclude that 


1 R(A n ) “4- V t and £ “n 1 Pm) (w(A n ), i2(x n )) < oo 

71—1 
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with probability 1 whenever Assumption IF1 holds in addition to the assumptions of the lemma. 

In the somewhat simpler case that Assumption [F3 holds, we may simply remove all indicator 
functions 1 {lC n j}, as Lemma [T2l shows that we may replace inequality (12611 with 

oo n— 1 oo poo 

Y a ™ Y E [ll- R ( x fc) -Cfcll 2 ] ^ CY a l nP ~ J U~ 2l3+P du< oo, 

n=1 k=n—nP n =1 

while sup n K n < oo with probability 1. 


A.2 Proof of Lemma [6] 

We may write the difference 

n—1 

A n A n — x n x n — ^ " a k (E I)gk- 

k =1 

Recalling the definition (fITll of K n and our choice (fTUl) of p, this representation guarantees that 


n— 1 


|A n A n || A Kn^n ^ ) llfi'fcll j 


k=n—nP 


and so we have 


^ n \S n / i 

£ II® 

v fc=l v k= 1 K i=k-kP 


(27) 


(28) 


Let us first show that the quantity ()28l) is well-behaved in the simpler case of Assumption |F3 
Indeed, we have 


E 


E 

k =1 




fc-1 

E 

i=k—kP 


hi 


< 


n 


Y a kk p 


< 


n 


u- 


P-^di 


n 2 +p 13 , 


k =1 


which tends to zero if and only if f3 > p+ Then inequality (f28l) implies n 2 Yl k =i IIA n — A n || < 

K n Zni where Z n —\ 0 and sup n K„ < 00 with probability 1 by Lemma [4j thus, the convergence (1211) 
holds under Assumption IF! 

We turn to the somewhat more challenging case that Assumption [F] holds and that for our 
choice of bound t on K n , there exist constants C\ n t = maxfc< n E[1 {JC k ,t} II Afc|| 2 ] such that C^,t = 
sup n CA,n,t < 00 . In this case, inequality (1281) and the definition (fT8l) of the event K, n ^ = {K n ,t <t} 
imply 

k -1 

1 {K"n,t} 'Y, lb* II 

i=k—kP 


E 


1 {En,t} —J= Yh II Afc - Afell <—J=Yj ak E 
y/n f—■' \Jn f —■' 


Now, we note that for k < n, we have IC k .t A KL n ,t so that 1 {/C ni t} < 1 {JC k j}, and 


E[||fl , fc|| 2 1 < E[|| 9fc || 2 1 {/C M }] < 2E[||R(x fc )|| 2 1 {£*,*}] + 2E[||4|| 2 1 {/C M }] 


^ C| ,n-l,t + C’l.n-l ,t 


+ 1 


(29) 


27 








by Assumption [El and that Kk,t £ J^k- 1 - Thus we have E[||gfc|| 2 1 {/C n) t}] < CA,n-i,t + lj and by 
Jensen’s inequality and inequality (1271) . we have 


E 


1 {ic n , t } 


|A„- A n | 


< f 2 a 2 n p 


n —1 

E 

k=n—n p 


E 


1{/C n ,t} 


<C't 2 «^(Ci >n _ M + l), 


where C is some universal constant, by the bound (|29[) . This gives statement (1201) of the lemma. 
To obtain the convergence guarantee (|2U) in the case of Assumption [F] and that sup n C A ,n,t < oo, 
note that 


fc-i 


i=k—k p 


1{/C n ,t} E hiW <^ a kk p (C A: k-i,t + 1) 

k =1 

/ n 

u p ~ p du x (C'A.n-qt + l)n 1+p_/3 . 


< 

r^j 


In particular, we have 


E 


l n ~ 

Y'llAfc- A 

v fc=i 


< 


(C'A.n-l.t + l)tn? +p 13 , 


which tends to 0 if f3 > p + ^ • Thus, we have shown that for any e > 0, we have for any t > 0 that 


lim P f K n t and n 2 II— Adi > e ) =0. 

n.—± oo \ ’ ^^ / 


(30) 


fc=i 


We now use expression (1301) to get the desired convergence result in the lemma. Let D n = 
n ~2 ||Afc — A fc || be shorthand for our error sum. Fix <5 > 0 and let t be large enough that 

P(/Coo,t) > 1 — S, which we know is possible by Lemma [U Then as K, n ,t C fcoo.t by definition, we 
have 


P (D n > e) < P (/Coo it and Ai > e) + P(/C^) < P (/C„ it and D n > e) + A 
Taking the limit as n -> oo, we find from expression (1301) that 

1 

y/n 


E 

k =1 


I At — At II > 


<5, 


lim sup P 


and as 5 > 0 was arbitrary, we have the desired convergence guarantee (ED. 

Lastly, we show that expression (1221) holds under the conditions of the lemma. We use inequal¬ 
ity (1271) and the Borel-Cantelli lemma for this. Under Assumption [F] and the additional condition 
that C\ t = sup n E[1 {/C n ,t} || A,„ || 2 ] < oo for all t or Assumption fF 7 ! we have 


P /C 




|A„ - A„|| > e) < e 2 E 


n— 1 


K Wn[ E H^ll 


\k=n—n p 
1 tit* rn 112 • 


<*) teiruw,}] (“> 
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Here inequality (i) follows from Jensen’s inequality and the fact that K n l < f, and inequality 

(ii) follows either by the bound (|29l) (when Assumption [F] holds) or because E[||gfc|| 2 ] < 1 for all k 
(when Assumption [Hi holds!. In particular, we have 

OO OO 

E P IIA n - A„|| > e) < ]Tn 2 ^- 2/3 < oo 

71=1 71=1 

whenever f3 > p+ which we have already assumed. Thus, we find that under the assumptions of 
the lemma, we have 1 {/C n ,t} ||A n — A n || > e only finitely many times, for any i£l. By Lemma HJ 
with probability 1 there is some t < oo such that < t, that is, as /Coo,t C K, n ,t it must be the case 
that K. n t always holds. So we find that ||A„ — A n || > e only finitely many times with probability 
1. That is, ||A n — A n || ^4' 0, and the continuity of V gives the almost sure convergence (1221) as 
desired, as we know that A n “4 0. 


A.3 Proof of Lemmas [7] and [8] 

If we define B^ = \\ k ' = i{I — o-iH), we have that A' n+l = Bf Ai — Ylk=l a kB k+1 £ k , and additionally 
we have 

7i 7i n n 

E a' = x] Bt 1 Ai - e H-^e +e wze, 

k=l k =1 k =1 k =1 

where the matrix WJ} is dehned by W k = a k ElE+i B k +i ~ H~ l . This matrix is well-structured, 
as the following lemma shows. 

Lemma 15 (Polyak and Juditsky [if]], Lemma 1). Let /3 € (0,1). Then 


sup \\W k \\ < 00 

k,n 


1 J 

and lim — E ||ILT|| = 0. 
n n 

k =1 


Thus—as we show rigorously shortly—the behavior of Ylk=i A k is governed almost completely by 

ELi H ~^ k - 

Now, by the iteration (1231) . we have that 

A fc+1 = A k -a k [R(x k ) + ^j = (I - a k H)A k + a k (HA k - R(x k )) -a k £ k , 

= '-%k 

so that by analogy with A' k we have 


E a; = E B i~^i - E H ~^ k + E w ^ k - E + E w * z *- 

k= 1 k =1 k =1 k =1 k =1 k =1 

Using the iteration (1231) for A' k , we thus have 

72 71 72 

E (a* - A') = E^- 1 - W£)Z k = E (H' 1 - Wk )(HA k - R(x k )). (31) 

k= 1 k= 1 i =1 
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Thus, to show that y / n( A n — A n ) A 0, it suffices to show that the rightmost sum in expression (1311) 
is op(y/n). 

By Lemma [T5l we know that sup kn \\fl~ 1 — W k \\ < oo, and thus 


^(Afc-A;) <^2\\HA k -R(x k ) <^2(\HA k -R(x k ) + || R{x k )- 


(32) 


k =1 


i= 1 


fc=l 


We consider each of the right-hand terms in inequality (|32l) in turn, beginning with the second. In 
this case, Lemma [U] implies that under the conditions of either Lemma 0 or [SJ we have 


/ L ~r I L ~r~ i 6 

— V II R(x k ) - R(x k )|| < — \\x k - x k \\ = — V ||A fc - A 


fcii A o. 


(33) 


We now turn to the error part H A k —R(x k ) of inequality (1321) . To that end, let e > 0 be the value 
in Assumption IDl such that \\H(x — x*) — R(x)\\ < C ||x — ar*'|| 1+0 ' for x such that ||x — x*|| < e. 
Splitting the sum into two parts, Assumption IDl thus implies 


11+7 


^2 \\ HA k - R{x k ) <^2^HA k - R(x k ) 1 j||A fc || > e| + \\A k 

k =1 k =1 k =1 

As we know that A^ A' 0, with probability 1 over {AA 2 > ■ ■ ■ } U {iA } ? >j (recall the convergence 
guarantee (1191) after Lemma [5} there exists some (random) N(e) < oo such that ||A^|| < e for all 
k > N(e), so that 

n 

lim N LfAfc — R(x k ) 1 < ||Aj.|| > e > < oo w.p. 1, 
n— >oo || l J 


k =1 


and 


n 2 


n 

5 \\HA k - R{x k )|| 1 {||A fc || > e} A 0. 


k =1 


It thus remains to argue that n ~2 ||A|| 1+7 ^ 0 in probability (or otherwise). 

Let e > 0 be such that (W(x — x*),R(x)) > AoV(x — x*) for all x such that ||x — x*|| < e (such 
an e certainly exists under both Assumptions IFl and [FT) . Define the events 


£ a = {ll All < e, all* € {fa] ,... ,fc}} . 


Dividing the sum into two parts, we have 


n n n 

E iiAii 1+7 < 2 iiLii 1+7 i {^/i 1 } + E A* 

k =1 k =1 k=1 

a.s. 


1+7 (1-1 


KV}) ‘ 


By the fact that A^ -A’ 0, we know that there exists some (random but finite) N(e) such that 
||Afc|| < e for all k > IV(e); the second term in the preceding display is thus hnite with probability 
one and 


n 


-i^||A t || 1 + 7 ( 1-1 {£*-■}) “4-0. 


k =1 
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By combining expressions (13T1) . (1321) . and (1331) with the above display, we see that to prove Lemma[7] 
or El all that remains to show is that 


™- 4 Eii a *ii 1+1 i KV} 40 - 

k =1 


(34) 


A.3.1 Proof of Lemma 0 


We give a single-step bound on V(A k ) that we can use to give the desired convergence guarantee 
under the conditions of Lemma [71 Recall the definitions <mi and csd of K n and the associated 
event /C n .j = {K n < t}, and recall also our assumption (fT6l) that p € (yrx, /3 — yE_) C (^rry, /3 — 
where p is the power used in the definition of K n . We claim that there exist constants c > 0 and 
C < oo, independent of C\ t = sup fc E[1 {IC kjt } || Afc|| 2 ], such that for l £ R with l < k, we have 


E 


V(A k+1 )l 


< E 


(1 - ca k + Ca\k p )V{A k )l 
+ Ct 2 a\k p {C\ t + 1). 


(35) 


We temporarily defer proof of this claim and show how to use it to show X)l-=i 11 ^-k 11 1+1 = op(y/n). 

Let K < oo be large enough that for the constants c, C in inequality (|35l) . there is a constant d 
such that for k > K and stepsizes a k = ak~P , we have 


(1 — ca.i + Ca 2 i p ) < exp (—doti) 


for i > k/2. This must be possible as we have assumed p < (3. Then by recursively applying 
inequality (135]) . we have for k > K, 


E 


R(A fe )l{^ / - 1 ,/C M } 


fc-i 


fc-i 


fc-i 


< exp -c Y a i) E[R(Ar fc/2l )! {JC lk/ 2 it}} +C{C\ >t + 1 )t 2 Y «?* P exp -c Y 


an 


i=\k/2\ J 


i=\k/2\ 


j=i+l 


k -1 

< exp (-c'A a ~ /3 ) E[R(A|- fe/2 ])l {K-\ k / 2 \,t}] + C Y «^ P exp 

i=\k/2~\ 



(36) 


where the second inequality follows because 'Yj=i a i x & 1 13 —l 1 13 , and k 1 !3 — {k/2) 1 3 > KJ^-k 1 &. 

Now, by using the assumption that E[1 {lC k j} V(Afc)] < E[1 {JC kj t} ||Afc|| 2 ] < C\ t for all A:, we 
find that 


E E [llA.f+n {km,^ 1 }] £ KC 'A + E E [H A *II 21 


2 


k =1 


k>K 


(0 


< KCYt + Y* ( exp \°k l ,3 J + C Y^ «?^ p exp c{k l 13 — i 1 

i—k/2 


k>K 


< kc'Y 1 + Y exp 


k>K 


c(l + 7 ) ,. 1 -g 


+ c y ( fep_2/3 Y exp (- c ( fcl_/3 

i=k/2 


k>K 


1+7 

2 
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where step (i) follows from inequality (1361) . By Lemma [13j we know that 

k 


Y ex P y—c{k l ~P — < r 

i=k /2 


1 


1-/3 


so that 


+ 


k=K 


Y E [ll A fcl| 1+71 { £k kj2^k,t}\ < KC i +7 + C Y ex P (- ck 1 - /3 ) + C Y 

k =1 ' 

TVT • 1 v—*77 , / n _/Q4 1+7 1 _ (/3 —p)(l+7) 

Noting that ^fc=i ^ P p 2 x 71/1 2 , we have 




fc=X 


n 


n 

■I^IIAfcf+nj^- 1 ,^} ^0 if (/3-p) 


fc=l 


or ^ (37) 


Our initial choice of p satisfied this inequality, so we have that for any t € R the preceding 
convergence guarantee holds. 

Now, let 5 > 0 be arbitrary and using LemmalU choose t large enough that P(/C 00) () < 5. Then 


P In 2 


1 w ||A t ||‘+n {s^} > e < P K^t, n-i V ||A t ||‘+n {l*- 1 } > e + P« 4 , t ) 

\ k= 1 / V fc=l / 

< P ^n-3 ||Afc|| 1+7 l 1 > e ) + 5 > 

S -—-v-' 

—>•0 as n—>■ oo 

the convergence to zero a consequence of inequality (1571) . As 5 > 0 was arbitrary, we see that 
'ELl lAll 1+ n (£‘- 2 '} Ao.and expression (1341) gives the lemma. 


n 2 


Proof of inequality (1351) . Recall the definition of the event = {||Aj|| < e, all i = [7] ,..., k} 
and that 1 {/Cfc+pt} < 1 {ICk,t}- By inequality (I25bj) and the fact that € /Fk-i for any i <k and 
K. k}t € F k - i, we have 


E 


y(Afc+i)i {e?,ic k+lt t }] < E [E [v(A k+1 ) | F k - ± ] 1 

(V(A fc ) - (vV(A k ),R(x k ))+Ca 2 k k p V(A k )^ 1 {$*,£*,*} 


+ CE 


fc-i 


a 


'Kfe £ ||R(xO + e i || 2 l{^ 1 ,/C fcif }+^(|| A ,|| 2 + l)l{/C M ,Ef- 1 } 


i=k—kP 


where we have used Assumption [El that t/ k = ^fc(O) + ((x k ), and E[sup fc E[||^(0) || 2 | F k -i\\ < oo. 

We now use Lemma [12] to provide control of the preceding inequality. Under Assumption 1F1 
coupled with sup fc E[1 {K*. t } ||Afc|| 2 ] < C\ t < oo, Lemma 1T21 implies that 


E 


fe-i 

« 2 k 2 Y ||^) + ^II 2 i{^ _1 ,k m } 

i=k—k p 


< a\ t 2 k p {C\ t + 1). 


Noting that || A^.|| 2 < CV(A k ) and (\7V(A k ), R(x k )^ > \oV(A k ) by Assumption [F] we obtain 
inequality (]35l) as desired. 
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A.3.2 Proof of Lemma [8] 

As in the proof of Lemma [71 we show that expression (T34l) holds. As before, we have p € — 

j^:) as the power used in the definition of K n . We begin by considering the progress made by a 
single step of the iteration with alternate error terms. We first claim that, similar to inequality (|35l) . 
that there exist constants c, C such that for any l < k with l € K, we have 


IE 


V(A k+1 )l {£*,/C fc+lit }] < (1 - ca k + Ca 2 k p )E [v(A k )l 


+ Ca%t 2 k p . 


(38) 


Indeed, as in the proof of inequality ([35]) . we use inequality ()25al) to obtain 


E 
< E 


y(A fe+1 )i {st,jc k+1 , t } 


< E 


E 


^(Afc+i) | Tk- i 
2 




(v(A k ) - a k (W(A fc ), R(x k )) + Ca 2 k k p V( A fc )) 1 K h , t ) 

fc-i 


+ C E 


alK 2 J2 \\R(xi) + a 2 l{£^\lC k , t } +al(\\R(x k )\\ 2 + l)l{lC k>t ,S^- 1 } 


i=k—kP 


where we have used Assumption IF1 that E[||^|| 2 | J- k -i\ < 1 f° r all k. Using our assumption that on 
the event we have (\7V(A k , R(x k )^ > XoV(A k ) and that ||i?(x)|| < 1 for all x (Assumption IF 7 ]). 
we obtain the desired inequality (|38|). 

Again paralleling the proof of Lemma [71 let K < oo be large enough that for the constants c, C 
in inequality (1381) . there is a constant c' such that for k > K and stepsizes a k = we have 

(1 — + Cafi p ) < exp(—c'ctj) for i > k/2. Then by recursively applying inequality (1381) . we have 

for k > K, 


E 


U(A fc )l{^- 1 ,/C M } 


fc-i 


fe-i 


<exp (-c'k 1 E U(A rfc/2l )l {/C rfc/2lj f} +C J a?t p exp -</ a,- 


J > 


i=[k/2] 


j=i+l 


(39) 


exactly as in the derivation of inequality (1361) . 

The remainder of the proof is completely identical to that of Lemma [71 


A.4 Proof of Lemma 1111 

Fix n € N and recall our assumption (fl6l) that p € (—^r, (3 — |), where p is used in the definition of 
the ratio K n and event /C n ^ (Defs. (fITl) and (fl8]) ). and consider E[V’(A n+ i)l {/C n _|_i^}]. Combining 
inequality (]25bl) . the fact that 1 {/C n ^} is non-increasing (because /C n+ i,t C /C ni t), and JC H) t € F n -\, 
we see that defining 

C A,n,t = maxE[||A fc || 2 1 {/C M }], 

we have 


E 


U(A n+1 )l{/C n+M } 


< E 


U(A„ +1 )1 {K n ,t} 


< E 


(v (A n ) - a n (W(A„), R(x n )) + Ca? l n p \\S7V (A n )|| 2 ) 1 {/C n , t }] + Ct 2 a 2 n n p (C 2 A ^ t + 1). 
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For the final inequality we have used inequality (f29l) . that is, E[||gfc|| 2 1 {lCk,t}\ < c(C^ n t + 1). By 
using that ||VF(x — x*)\\ 2 < L 2 || x — x*\\ 2 < j T V(x — x*) by our assumptions on V, we thus obtain 


E 


V(A n+1 )l{/C n+1 , t } 


< E 


(Y(A n ) - a n ( W(A n ), R{x n )) + C a 2 n n p V (A n )^J 1 {K n , t }\ + Ct 2 aln p (C\^ t + 1) 


Now, noting that (VF(A n ), R(x n )) > XoV(A n ) by Assumption IF1 we have 


E 


V (A n+1 )l {/C n+M } 1 < (1 - A 0 a n + Cain?) E [p(A n )l {/C n , t } 1 + Ct 2 a 2 n n p (C\^ t + 1) 


By recursively applying this inequality, we have 


E 


[y(A 

n+1 )1 {K-n+l,t\ 


< J](l-Ao«A ; + C'alF)E[l/(A 1 )]+C't 2 (C'i iriit + l)^alK J] (1 - A 0 a; + Coif l?). 

l=k -\-1 


(40) 


/c=l k =1 

We state a technical lemma the controls the products above. Let 

k 


b i := na-^+ccSn 


i=l 


Lemma 16. Let the scalar sequence bf be defined as above with l < k, f3 > p, and = ori~ l3 
where C > 4Aq (so that each term in the product is non-negative). There exist constants co,ci,C 2 
(dependent on fi, a, Ao, and C) such that 


b[ < cq exp —ci 


k \ 

j < C 0 exp ( '-c 2 {k 1 -' 3 - / 1_/3 )) . 

i=l J 


(41) 


Proof This result is similar to a result of Polyak and Juditsky [261 . proof of Lemma 1, Part 
3], but with some differences for additional powers in the sequence. As a, = ai~ 13 and we have 
assumed that (3 > p, there exists some K € N such that for k > K we have 2Ca 2 k p < Aoa*,, or 
ak^ 13 < Ao/(2 C). For any k > K, we have (1 — Ao a-k + Cce 2 ) < 1 — Xoa.k/2 < exp(— 4pafc)- We 
find that 

k k (lAK)-l 

b^ = - X 0 ai + Ca 2 i p ) = (1 — A 0 oti + Ca 2 i p ) (1 - A 0 a, + Ca 2 i p ) 


1=1 

K 


i>lAK 


i=l 


< JJ max{l, 1 — Aq ct, + Ca 2 i p } exp | 


1=1 


i>lAK 


Ao 


< 


exp ( ^ ai ) II max U; 1 — -^0 ati + Ca 2 i p } 


K 


i =1 ' i= 1 


K 


exp 


■£ 


OLi 
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The term in the braces [•] in the preceding product is the constant Co, giving the first inequality of 
expression (fTT!) . For the second, note that 


k 

i—l 



Oi 




which completes the proof. 


□ 


Applying Lemma [16] in inequality (1401) . we obtain for constants c, C independent of C/\. n ,t that 


E 


V(A n+1 )l{JC n+1:t } 


<Cexp(—cn 1 /3 )E[l/(Ai)]+C'f 2 (C'^ rijt +l) ^ exp c(n 1 13 

k =1 



Now we use the technical Lemma [Lfl which shows that the final sum tends to zero when ak = ak~^. 
In particular, for any e > 0, there exists some N(e,t ) < oo, independent of CA,n,t, such that 
n> N(e, t) implies 


C exp(—cn 1 ”^) < e 2 and Ct 2 a 2 k p ~ 2f3 exp (-c(n 1 " /3 - < e 2 . 

k =1 


That is, we have 


E 


I/(A n+1 )l {JC n+1(t }] < e 2 (E[V'(A 1 )] + Ci >n>t + 1) . 


As e > 0 was arbitrary and there are constants c, C such that c ||x — x*|| 2 < V(x— x*) < C\\x — x* 
this gives Lemma fill 


A.5 Proof of Lemma [T3l 

We prove the result via a change of variables. Let u = c(t K 

_1_ -i k — 1 

t = (u/c + a K ) k , du = nct K ~ dt = kc (u/c + a K ) K dt, 


a K ), so that 

1 1 —K 

or dt = {nc)~ ( u/c + a K ) k du. 


That is, by our change of variables, we have 

rb 


exp (— c(t K — a' 


l ))dt = - [ 

KC J 0 


i r c ^~ aK) ( u 

( - + a 

o v c 


1 — K 

K, \ K „—U 


e~ u du 


< 


max{2 « , 1} 


KC 


rc{b K -a K ) .... l=* rc(b K —a K ) 

j (-J K e~ u du + J a 1 ~ K e~ u du 


1 — K 

where the final inequality follows by convexity of 1 H > t~^~ 

1—K 1—K 

t 1 K + t 2 K for K A \ (or < 1). Noting that / 0 °°u' 
obtain our desired result. 


, for k < \ and the fact that (t\+t 2 ) K < 
e~ u du = T(i) and J 0 °° e~ u du = 1, we 


35 












A.6 Proof of Lemma 1141 

The quantity in the summation diverges or converges identically to the integral 


/ TL pCLTh PTI 

u p ~ 2 ^ exp du < J exp df + J u p ~ 2 ^dt (42) 

for any a G [0,1]. Now, by concavity of a G u 1-/3 for /3 € (^, 1), we have u 1-/3 < n 1_/3 + (1 — 
j3)n~P{u — n), or n 1 ”^ — u 1 ^ 13 > (1 — /3)(n — u)n~ 13 . In particular, the first integral on the right 
side of the display (|42j) has bound 

f an / „ \ r an / r \ r-i3-( l - a ) n 

J exp y—cfn 1-13 — u 3 ~^)jdu < J exp ^- -^(n — u)j du = — J " —exp(— u)du 


n 


0 rcn 


1-/3 


n H 


exp ( —c(l — a)n 1 ^' 3 ) — exp ( —cn 1 ^' 3 


(43) 


< — / e~ u du = 

C «/c(l—a)n 1— ^ ^ 

n P ( . I Q 

< — exp — c(l — a)n M 

c V 

where we made a change of variables. For the second integral, we have 

r u p ~ 2f) du = -±-(n 1+p - 2 P - (an) 1+p ~ 2 P), 

J an ^ 

and combining this with (1431) in the bound (1421) . we obtain for any a € (0,1) that for constants 

C,c, 

n r i_|_p 2/3 

£ k p ~ 2 P exp (-c(n^ - k 1 - /3 )) < C n 0 exp (-c(l - a)n 1 " /3 ) + " g (l - a 1+p “ 2/3 ) 


fc=i 


By our assumption that p < 1 and a G (0,1), the first term above converges to zero; our assumption 
that p < /3 — ^ guarantees that the second does as well. 
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